n <- 200
set.seed(1)
x <- sort(runif(n))
y <- sin(12*(x + 0.2))/(x + 0.2) + rnorm(n)/2
df <- data.frame(y, x)
plot(x, y, main="Simulated data", col= "grey")Neural Networks - The Idea
MBAN - Winter 2024
Neural networks (NN) are a class of models that are inspired by the human brain. They are however parametric models, and they are used for supervised learning.
1 The Idea
1.1 Polynomial Regression
Let’s start with a predictive model with a single input (covariate). The simplest model could be a linear model:
\[ y \approx \alpha+\beta x \]
Since this model could be a quite restrictive, we can have a more flexible one by a polynomial regression:
\[ y \approx \alpha+\beta_1 x+\beta_2 x^2+\beta_3 x^3+\ldots=\alpha+\sum_{m=1}^M \beta_m x^m \]
The polynomial regression is based on fixed components, or bases: \(x, x^2, x^3, \ldots, x^M\).
Let us consider a realistic (simulated) sample:
We can fit a polynomial regression with M = 3
ols <- lm(y ~ x + I(x^2) + I(x^3))
plot(x, y, main="Polynomial: M = 3", col= "grey")
lines(x, predict(ols), col="blue", lwd = 3)Now, we can think of the line as weighted sum of fixed components: \(\alpha_1+\beta_1 x+\beta_2 x^2+\beta_3 x^3\).
# Parts
first <- ols$coefficients[2]*x
second <- ols$coefficients[3]*x^2
third <- ols$coefficients[4]*x^3
yhat <- ols$coefficients[1] + first + second + third
# Plots
par(mfrow=c(1,3), oma = c(0,0,2,0))
plot(x, first, ylab = "y", col = "pink", main = "x")
plot(x, second, ylab = "y", col = "orange", main = expression(x^2))
plot(x, third, ylab = "y", col = "green", main = expression(x^3))And their sum:
plot(x, y, ylab="y", col = "grey",
main = expression(y == alpha + beta[1]*x + beta[2]*x^2 + beta[3]*x^3))
lines(x, yhat, col = "red", lwd = 3)
mtext("Fixed Components",
outer=TRUE, cex = 1.5, col="olivedrab")1.2 Polynomial to Neural Network
The artificial neural net replaces these fixed components with adjustable ones or bases
\[ f\left(\alpha_1+\delta_1 x\right), f\left(\alpha_2+\delta_2 x\right), \ldots, f\left(\alpha_M+\delta_M x\right) \]
where \(f(.)\) is an activation function. We can see the first simple ANN as nonlinear functions of linear combinations:
\[ \begin{gathered} y \approx \alpha+\beta_1 f\left(\alpha_1+\delta_1 x\right)+\beta_2 f\left(\alpha_2+\delta_2 x\right)+\beta_3 f\left(\alpha_3+\delta_3 x\right)+\ldots \\ =\alpha+\sum_{m=1}^M \beta_m f\left(\alpha_m+\delta_m x\right) \end{gathered} \] where \(f(\).\() is an activation function\):
- The logistic (or sigmoid) function: \(f(x)=\frac{1}{1+e^{-x}}\);
- The hyperbolic tangent function: \(f(x)=\tanh (x)=\frac{e^x-e^{-x}}{e^x+e^{-x}}\);
- The Rectified Linear Unit (ReLU): \(f(x)=\max (0, x)\);
Hence, adjustable components enable to capture complex models with fewer components (smaller M).
Let’s replace those fixed components \(x, x^2, x^3\) in our polynomial regression with \(f\left(\alpha_1+\delta_1 x\right)\), \(f\left(\alpha_2+\delta_2 x\right), f\left(\alpha_3+\delta_3 x\right)\).
The following code demonstrates the ability of a simple artificial neural network (ANN) with arbitrary parameters to capture the underlying signal relative to a third-degree polynomial regression model. It defines an ANN function with sigmoid activation functions for three nodes \((M=3)\), arbitrary parameters a , b , beta , and an intercept ( int ). For each node, the code calculates the weighted input ( \(z\) ) using a and b , and then applies the sigmoid activation function to obtain the output ( sig ). The output is then multiplied by the corresponding beta value. The final output ( yhat ) is calculated as the sum of the intercept and the weighted outputs from all three nodes.
a = c(1.5, 9, 3)
b = c(-20,-14,-8)
beta = c(15, 25,-40)
int = 3
ann <- function(a, b, beta, int) {
#1st sigmoid
a1 = a[1]
b1 = b[1]
z1 = a1 + b1 * x
sig1 = 1 / (1 + exp(-z1))
f1 <- sig1
#2nd sigmoid
a2 = a[2]
b2 = b[2]
z2 = a2 + b2 * x
sig2 = 1 / (1 + exp(-z2))
f2 <- sig2
#3rd sigmoid
a3 = a[3]
b3 = b[3]
z3 = a3 + b3 * x
sig3 = 1 / (1 + exp(-z3))
f3 <- sig3
yhat = int + beta[1] * f1 + beta[2] * f2 + beta[3] * f3
return(yhat)
}
yhat <- ann(a, b, beta, int)
plot(x, y, main = "ANN: M = 3", ylim = c(-5, 15))
lines(x, yhat, col = "red", lwd = 3)For now, let’s obtain them with neuralnet.
library(neuralnet)
set.seed(2)
nn <- neuralnet(y ~ x, data = df, hidden = 3, threshold = 0.05)
yhat <- compute(nn, data.frame(x))$net.result
plot(x, y, main = "Neural Networks: M = 3")
lines(x, yhat, col = "red", lwd = 3)Why did neural networks perform better than polynomial regression in the previous example? Again, adjustable components enable to capture complex models. Let’s delve little deeper. Here is the weight structure of
\[ \begin{gathered} y \approx \alpha+\sum_{m=1}^3 \beta_m f\left(\alpha_m+\delta_m x\right) \\ =\alpha+\beta_1 f\left(\alpha_1+\delta_1 x\right)+\beta_2 f\left(\alpha_2+\delta_2 x\right)+\beta_3 f\left(\alpha_3+\delta_3 x\right) \end{gathered} \]
nn$weights[[1]]
[[1]][[1]]
[,1] [,2] [,3]
[1,] 1.26253 6.59977 2.504890
[2,] -18.95937 -12.24665 -5.700564
[[1]][[2]]
[,1]
[1,] 2.407654
[2,] 13.032092
[3,] 19.923742
[4,] -32.173264
plot(nn, rep = "best")- Error: It represents the final error of the best-performing neural network after the training process has concluded. It’s a snapshot of how well (or not) your model has learned the underlying pattern in our data.
- Steps: The 48113’ is the number of steps (iterations) the algorithm took to reach its current state, under the constraints we provided (like learning rate, threshold, etc.). It’s a testament to how hard your network worked to get to its final form.
We used sigmoid (logistic) activation functions
Node 1: \(\quad f(x)=\frac{1}{1+e^{-x}}=\frac{1}{1+e^{-(1.26253-18.95937 x)}}\).
Node 2: \(\quad f(x)=\frac{1}{1+e^{-x}}=\frac{1}{1+e^{-(6.599773-12.24665 x)}}\).
Node 3: \(\quad f(x)=\frac{1}{1+e^{-x}}=\frac{1}{1+e^{-(2.504890-5.700564 x)}}\).
We can calculate the value of each activation function by using our data, \(x\) :
X <- cbind(1, x)
# to 1st Node
n1 <- nn$weights[[1]][[1]][,1]
f1 <- nn$act.fct(X %*% n1)
# to 2nd Node
n2 <- nn$weights[[1]][[1]][,2]
f2 <- nn$act.fct(X %*% n2)
# to 3rd Node
n3 <- nn$weights[[1]][[1]][,3]
f3 <- nn$act.fct(X %*% n3)
par(mfrow = c(1,3), oma = c(0,0,2,0))
plot(x, f1, col = "pink", main = expression(f(alpha[1] + beta[1]*x)))
plot(x, f2, col = "orange", main = expression(f(alpha[2] + beta[2]*x)))
plot(x, f3, col = "green", main = expression(f(alpha[3] + beta[3]*x)))
mtext("Flexible Components",
outer = TRUE, cex = 1.5, col = "olivedrab")Now we will go from these nodes to the “sink”: \[ \begin{aligned} & \frac{1}{1+e^{-(1.26253-18.95937 x)}} \times 13.032092 \\ & \frac{1}{1+e^{-(6.599773-12.24665 x)}} \times 19.923742 \\ & \frac{1}{1+e^{-(2.504890-5.700564 x)}} \times-32.173264 \end{aligned} \]
Finally, we will add these with a “bias”, the intercept: \[ \begin{gathered} 2.407654+ \\ \frac{1}{1+e^{-(1.26253-18.95937 x)}} \times 13.032092+ \\ \frac{1}{1+e^{-(6.599773-12.24665 x)}} \times 19.923742+ \\ \frac{1}{1+e^{-(2.504890-5.700564 x)}} \times-32.173264 \end{gathered} \]
# From Nodes to sink (Y)
f12 <- f1*nn$weights[[1]][[2]][2]
f22 <- f2*nn$weights[[1]][[2]][3]
f23 <- f3*nn$weights[[1]][[2]][4]
# Results
yhat <- nn$weights[[1]][[2]][1] + f12 + f22 + f23
plot(x, y, main="ANN: M = 3")
lines(x, yhat, col="red", lwd = 3)2 Neural Networks - More inputs
With a set of covariates \(X=\left(1, x_1, x_2, \ldots, x_k\right)\), we have:
\[
\begin{gathered}
y \approx \alpha+\sum_{m=1}^M \beta_m f\left(\alpha_m+\mathbf{X} \delta_m\right)= \\
=\alpha+\beta_1 f\left(\alpha_1+\delta_{11} x_{1 i}+\delta_{12} x_{2 i} \cdots+\delta_{1 k} x_{k i}\right)+\ldots \\
+\beta_M f\left(\alpha_{M 1}+\delta_{M 1} x_{1 i}+\delta_{M 2} x_{2 i} \cdots+\delta_{M k} x_{k i}\right)
\end{gathered}
\] We will use the neuralnet package
library(MASS)
data("Boston")
str(Boston)'data.frame': 506 obs. of 14 variables:
$ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
$ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
$ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
$ chas : int 0 0 0 0 0 0 0 0 0 0 ...
$ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
$ rm : num 6.58 6.42 7.18 7 7.15 ...
$ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
$ dis : num 4.09 4.97 4.97 6.06 6.06 ...
$ rad : int 1 2 2 3 3 3 5 5 5 5 ...
$ tax : num 296 242 242 222 222 222 311 311 311 311 ...
$ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
$ black : num 397 397 393 395 397 ...
$ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
$ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston) crim zn indus chas
Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
nox rm age dis
Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
rad tax ptratio black
Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
Median : 5.000 Median :330.0 Median :19.05 Median :391.44
Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
lstat medv
Min. : 1.73 Min. : 5.00
1st Qu.: 6.95 1st Qu.:17.02
Median :11.36 Median :21.20
Mean :12.65 Mean :22.53
3rd Qu.:16.95 3rd Qu.:25.00
Max. :37.97 Max. :50.00
sum(is.na(Boston))[1] 0
normalizedBoston <- as.data.frame(scale(Boston))set.seed(123) # For reproducibility
ind <- sample(1:nrow(normalizedBoston), size = 0.8 * nrow(normalizedBoston))
train_data <- normalizedBoston[ind, ]
test_data <- normalizedBoston[-ind, ]The neuralnet package requires the inputs as a dataframe and the targets (in this case, the medv variable for median house value) either as a separate vector or as part of the dataframe. Here’s how to prepare both:
train_targets <- train_data$medv
test_targets <- test_data$medv
train_data$medv <- NULL
test_data$medv <- NULLWe are ready
library(neuralnet)
# Define the formula
formula <- paste("train_targets ~", paste(names(train_data), collapse = " + "))
# Train the neural network
set.seed(123) # For reproducibility
nn1 <- neuralnet(formula, data = cbind(train_targets, train_data), hidden = c(5), linear.output = TRUE)In this example, hidden=c(5) defines a neural network with one hidden layer containing 5 (3) neurons. linear.output=TRUE is used because this is a regression problem.
plot(nn1, rep = "best")Prediction error
mse.test1 <- mean((test_targets - predict(nn1, test_data))^2)
mse.test1[1] 0.1894812
Is it good? Let’s try more/less neurons.
nn2 <- neuralnet(formula, data = cbind(train_targets, train_data), hidden = c(4), linear.output = TRUE)Warning: Algorithm did not converge in 1 of 1 repetition(s) within the stepmax.
What’s happening? The warning message you received, Warning: Algorithm did not converge in 1 of 1 repetition(s) within the stepmax, indicates that the optimization algorithm used by neuralnet to train the neural network did not successfully converge to a solution within the maximum number of steps allowed (stepmax). Convergence means that the algorithm has found a stable set of weights and biases for the neural network that are unlikely to change with additional training under the current settings.
This lack of convergence can occur for several reasons:
- Insufficient Training: The
stepmaxparameter (the maximum number of steps/iterations the training algorithm should run) might be too low, not giving the algorithm enough time to converge. - Learning Rate Issues: The learning rate might be too high or too low, causing the algorithm to overshoot the minimum or progress too slowly, respectively.
- Complex Model Structure: The model’s complexity (e.g., too many neurons or layers) relative to the data could make training difficult.
- Data Issues: Problems with the data, such as scaling issues (even though we’ve scaled your data, it’s still something to consider), or the dataset not being representative or informative enough for the task.
How to Address Convergence Warning
- Increase
stepmax: Providing the algorithm with more iterations may help. You can increasestepmaxin theneuralnetfunction call. - Adjust the Learning Rate: If the learning rate is adjustable in your setup (through parameters like
learningrate_limit,learningrate_factor, etc., depending on the algorithm used), try fine-tuning it. - Simplify the Model: Reduce the complexity of the neural network by using fewer hidden layers or neurons.
- Reevaluate the Data: Ensure the data is appropriately preprocessed, including checking for outliers or errors in the data that might affect training?
# Train the neural network with stepmax and learningrate parameters
set.seed(123) # Ensure reproducibility
nn2 <- neuralnet(formula, data = cbind(train_targets, train_data), hidden = c(4),
linear.output = TRUE, stepmax = 1e5, learningrate = 0.01)plot(nn2, rep = "best")mse.test2 <- mean((test_targets - predict(nn2, test_data))^2)
mse.test2[1] 0.2314028
The neuralnet function in R is highly configurable, offering a wide array of arguments to tailor the neural network training process to specific needs. Here’s a breakdown of each argument in the function call you’ve provided:
2.1 Arguments
Core Arguments
formula: A symbolic description of the model to be fitted. The format is response ~ terms where response is the (numeric) dependent variable, and terms are the independent variables or predictors.data: The data frame containing the variables specified in the formula.
Architecture and Training Configuration
hidden: A vector indicating the number and size of hidden layers and neurons. For example, c(5, 3) means two hidden layers with 5 neurons in the first layer and 3 in the second.threshold: The threshold for the partial derivatives of the error function as a stopping criterion. Training stops when all partial derivatives are below this threshold.stepmax: Maximum number of training steps. The algorithm stops if this number is reached.rep: Number of times the neural network is trained. Multiple repetitions can help in finding a better local minimum since the training starts with different random weights each time.
Learning Rate Parameters
startweights: Initial weights for the training process. NULL means that initial weights are chosen randomly.learningrate.limit: Numeric vector of length 2 providing the minimum and maximum limit for the learning rate. It’s only relevant for the traditional backpropagation (algorithm = “backprop”).learningrate.factor: A list with components minus and plus defining the factors by which the learning rate is decreased or increased. Relevant for resilient backpropagation (rprop+ or rprop-).learningrate: The learning rate for the weights’ update. Relevant for backprop.
Output and Algorithm Control
lifesign: Output type during training, can be “none”, “minimal”, or “full”.lifesign.step: The frequency of the output defined by lifesign.algorithm: The algorithm used for training. Can be “rprop+”, “rprop-”, “backprop”, or “slr”.err.fct: The error function. Can be “sse” for sum of squared errors or “ce” for cross-entropy.act.fct: The activation function, typically “logistic” or “tanh”.linear.output: Logical. If TRUE, the output neuron uses a linear activation function. Relevant for regression tasks.exclude: Variables to be excluded from the model.constant.weights: Allows setting some weights to a constant value that won’t be changed during training.likelihood: If TRUE, the likelihood is returned additionally to the neural network’s results.
How to Use These Arguments The choice of these parameters can significantly affect the training process and the model’s performance. Here’s how to use them effectively:
- Adjust the architecture (
hidden) based on the complexity of the problem. - Set a reasonable
stepmaxto ensure the training process has enough iterations to converge but stops to prevent excessive computation. - Use multiple rep to mitigate the risk of getting stuck in a poor local minimum.
- Fine-tune the learning rate (
learningrate,learningrate.limit,learningrate.factor) to balance the speed and stability of convergence. - Choose an appropriate algorithm based on your problem and preference for speed vs. accuracy.
- Experiment with different
act.fctanderr.fctto find the best combination for your specific task.
Selecting and tuning these parameters usually require some experimentation and are guided by the specific characteristics of your dataset and the problem you’re solving.
2.2 More on Arguments
Epochs Explained
An epoch refers to one complete pass through the entire training dataset. If you have a dataset of 1,000 examples and you use all of them once for updating the weights, that counts as one epoch. Training neural networks often requires many epochs, meaning the entire dataset is used multiple times iteratively to update the network’s weights with the goal of minimizing the loss function.
Stepmax Explained
stepmax, on the other hand, refers to the maximum number of steps (iterations) the training process will perform. A step is a single update of the model’s weights. In gradient descent (and its variants), a step usually involves calculating the gradient of the loss function with respect to each weight, then adjusting the weights in the direction that minimizes the loss.
What’s rep?
In the context of the neuralnet package in R, the rep parameter specifies the number of times the neural network will be trained from the beginning with different initial weights. Each training run (or repetition) starts with a new set of random weights and goes through the training process independently of the others.
2.3 Type of Backpropagation
Resilient Backpropagation (Rprop) is an optimization algorithm used for training neural networks. It’s designed to overcome some of the difficulties associated with the traditional gradient descent method, especially the problem of its learning rate parameter. Gradient descent updates the weights of the network by moving them in the direction opposite to the gradient of the error function with respect to those weights, scaled by a factor known as the learning rate. However, finding a suitable learning rate that works well for the entire training process can be challenging.
Key Features of Rprop - Adapts the Learning Rate Individally for Each Weight: Unlike traditional gradient descent, which uses a single learning rate for all weight updates, Rprop uses a separate update value (similar to a learning rate) for each weight in the network, allowing it to adaptively adjust how much each weight is updated during training. - Only the Sign of the Gradient is Used: Rprop considers only the sign of the gradient (whether it’s positive or negative) to determine the direction in which to update the weights, ignoring the magnitude of the gradient. This approach helps in making the training less sensitive to the steepness of the gradient, which can vary significantly across different weights. - Update Values Adjust Based on Gradient Change: The algorithm increases the update value for a weight if the gradient of the error function with respect to that weight keeps the same sign (indicating consistent direction in the error landscape), which speeds up convergence. Conversely, if the gradient sign changes (indicating an overshoot past the minimum), the update value is decreased to allow for finer adjustments.
Variants of Rprop There are several variants of the resilient backpropagation algorithm, including:
- Rprop+ (Rprop Plus): Increases or decreases the update value based only on the change of sign in the gradient. If the gradient changes direction, indicating an overshoot, the previous update is undone. Rprop- (Rprop Minus): Similar to Rprop+, but without the weight backtracking step. It only adjusts the update values without reverting the last update.
- iRprop+ (Improved Rprop Plus): Addresses some issues in Rprop+ related to weight updates in the flat regions of the error function.
By adjusting the update values adaptively for each weight, Rprop can often converge faster than standard gradient descent, especially on problems where the gradient can vary significantly in different parts of the weight space. The algorithm is relatively simple to implement and does not require the fine-tuning of a global learning rate, making it appealing for many applications. In the neuralnet package in R, you can specify the use of the resilient backpropagation algorithm for training a neural network with the algorithm parameter: